Migrate to ANTLR v4 in Lucene.Net.Expressions, #977 #996

paulirwin · 2024-10-25T16:36:38Z

You've read the Contributor Guide and Code of Conduct.
You've included unit or integration tests for your change, where applicable.
You've included inline docs for your change, where applicable.
There's an open issue for the PR that you are making. If you'd like to propose a change, please open an issue to discuss the change or find an existing issue.

Migrate ANTLRv3 to v4, and automate lexer/parser generation.

Fixes #977

Description

The ANTLRv3 NuGet package we depend on has not been maintained, and targets .NET Standard 1.6 which is reporting some vulnerabilities. Additionally, our current codebase has what appears to be hand-ported lexer and parser code from the original Java, which in some cases has been updated manually as well. The original ANTLRv3 grammar is not in our repo. Ideally, we would generate the lexer and parser using ANTLR from the grammar. On top of this, we've had to change some assertions from the original Java due to discrepancies in the implementation.

This adds the 4.8 grammar to the repo, but updates its syntax to use the v4 format. Notably, ANTLRv4 removes the old v3 AST generation and instead lets you walk the parse tree however you see fit. This means that things like "root nodes" (indicated by !) in the old grammar are no longer necessary or supported, and thus require a different approach to walking the syntax tree. Additionally, empty lexer tokens like AT_CALL are no longer supported, so that use was changed into a new call rule instead, with corresponding LUCENENET-specific callout. Error handling is done another way now, resulting in not needing to add inline C# to the grammar. Other than these changes, the grammar is highly similar to the upstream 4.8 v3 grammar. I decided to stick with the 4.8 grammar instead of updating to the most recent grammar which is already in v4 format to keep it as close as possible to the 4.8 code.

The added Antlr4BuildTasks NuGet package (MIT-licensed) generates the lexer and parser code at compile time now, which reduces manual error and shrinks the size of the code we have to maintain. By removing the lexer and parser from the codebase, this PR results in a net reduction of about 3k lines of code. This also is configured to generate a Listener base class, which is the v4 approach that is closest to the existing 4.8 code (as opposed to a Visitor that returns nodes, as the latest Lucene code uses).

The JavascriptCompiler C# code now looks a bit different as a result of implementing the visitor, but this seemed like a cleaner and more maintainable approach than recursively walking every parse tree child. The Context classes that are generated for each rule are now strongly-typed, so you get the benefit of not having to compare rule/token indices. Another goal of this PR is to upgrade this dependency and change the approach in code without changing any existing tests or public API surface. As a result, the listener implementation is a private nested class instead of public, and this also helps keep the logic mostly in line with where it was before (just a little out of order). Note that we are not using the ParseTreeWalker because we want to control the order of entering nodes, so this might look a little different than other implementations of the Listener in ANTLR.

This also fixes some assertions that were changed from the original Java code, because this implementation produces the same results as the Java Lucene 4.8 code. Notably this is due to casting operands to longs before performing shift and bitwise operations, whereas the previous code was erroneously converting all numbers to int/long if there were any shifts or bitwise operations in the source text. This worked fine for some unit test examples, but could quickly break down for anything more complex than that. All numbers are treated as System.Double except for when shifts or bitwise operations are involved, in which case they are casted to Int64/I8 which truncates any decimal place values.

…TLRv3 transitive deps

paulirwin · 2024-11-05T15:11:47Z

@NightOwl888 Rebased off latest master to resolve a conflict, ready for review.

NightOwl888

Thanks for the PR.

This seems to work as good as it did previously, so nice work. The only thing we are missing is the attribution in the LICENSE.txt file, which we need to do each time a file or group of files that is under a different license is added.

src/Lucene.Net.Expressions/JS/Javascript.g4

…script.g4. Normalized dividers between different license attributions.

src/Lucene.Net.Expressions/JS/Javascript.g4

paulirwin mentioned this pull request Oct 25, 2024

ci: Do not run actions on draft PRs #722

Closed

paulirwin added this to the 4.8.0-beta00018 milestone Oct 25, 2024

paulirwin marked this pull request as ready for review October 27, 2024 20:33

paulirwin mentioned this pull request Oct 29, 2024

Automate Generation of QueryParser to C# #268

Open

paulirwin requested a review from NightOwl888 October 29, 2024 17:25

paulirwin added 5 commits November 5, 2024 08:01

Migrate to ANTLR v4 (WIP), apache#977

88108d1

Fix bug where logical_and and logical_or rules were always emitting IL

630942c

Fix failing unit tests with ANTLRv4 port, except Api style ones

cda4c6b

Exclude generated code from ApiScanTestBase tests

905caaf

Make Antlr4BuildTasks a build-time-only dependency, and remove old AN…

7570539

…TLRv3 transitive deps

paulirwin force-pushed the issue/977 branch from 7341c7e to 7570539 Compare November 5, 2024 15:09

NightOwl888 requested changes Nov 5, 2024

View reviewed changes

src/Lucene.Net.Expressions/JS/Javascript.g4 Show resolved Hide resolved

NightOwl888 added the notes:improvement An enhancement to an existing feature label Nov 10, 2024

LICENSE.txt: Added BSD license for src/Lucene.Net.Expressions/JS/Java…

407293b

…script.g4. Normalized dividers between different license attributions.

NightOwl888 self-requested a review November 11, 2024 09:39

NightOwl888 approved these changes Nov 11, 2024

View reviewed changes

src/Lucene.Net.Expressions/JS/Javascript.g4 Show resolved Hide resolved

paulirwin merged commit e05e0ff into apache:master Nov 11, 2024
199 checks passed

paulirwin deleted the issue/977 branch November 11, 2024 15:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate to ANTLR v4 in Lucene.Net.Expressions, #977 #996

Migrate to ANTLR v4 in Lucene.Net.Expressions, #977 #996

paulirwin commented Oct 25, 2024 •

edited

Loading

paulirwin commented Nov 5, 2024

NightOwl888 left a comment

Migrate to ANTLR v4 in Lucene.Net.Expressions, #977 #996

Migrate to ANTLR v4 in Lucene.Net.Expressions, #977 #996

Conversation

paulirwin commented Oct 25, 2024 • edited Loading

Description

paulirwin commented Nov 5, 2024

NightOwl888 left a comment

Choose a reason for hiding this comment

paulirwin commented Oct 25, 2024 •

edited

Loading